cluster 0
Explainable AI for Curie Temperature Prediction in Magnetic Materials
Ajaib, M. Adeel, Nasir, Fariha, Rehman, Abdul
Traditional approaches based on quantum mechanical computations or empirical models are often limited in scalability and accuracy. In recent years, machine learning (ML) has emerged as a promising alternative for property prediction across materials science domains [1-9]. Building on this momentum, several recent studies have proposed the use of ML models trained on curated magnetic datasets. In particular, the recent study [10] introduced the NE-MAD database, which aggregates experimentally measured magnetic transition temperatures and compositions. Similarly, the study by [11] utilized two of the largest available datasets of experimental Curie temperatures--comprising over 2,500 materials for training and more than 3,000 entries for validation--to compare machine learning strategies for predicting Curie temperature solely from chemical composition. Our work is inspired by these prior efforts and aims to improve the predictive accuracy and gain insights into model in-terpretability. We develop a pipeline that starts from the NE-MAD dataset, augments it with compositional and elemental features, and evaluates several ML models. A key contribution of our work is the integration of explainable AI (XAI) through SHAP (SHapley Additive exPlanations) analysis, which allows us to quantify how each input feature contributes to the model's prediction. Moreover, we benchmark our models on external datasets from literature to demonstrate generalization.
AttentiveGRUAE: An Attention-Based GRU Autoencoder for Temporal Clustering and Behavioral Characterization of Depression from Wearable Data
Soley, Nidhi, Patel, Vishal M, Taylor, Casey O
In this study, we present AttentiveGRUAE, a novel attention-based gated recurrent unit (GRU) autoencoder designed for temporal clustering and prediction of outcome from longitudinal wearable data. Our model jointly optimizes three objectives: (1) learning a compact latent representation of daily behavioral features via sequence reconstruction, (2) predicting end-of-period depression rate through a binary classification head, and (3) identifying behavioral subtypes through Gaussian Mixture Model (GMM) based soft clustering of learned embeddings. We evaluate AttentiveGRUAE on longitudinal sleep data from 372 participants (GLOBEM 2018-2019), and it demonstrates superior performance over baseline clustering, domain-aligned self-supervised, and ablated models in both clustering quality (silhouette score = 0.70 vs 0.32-0.70) and depression classification (AUC = 0.74 vs 0.50-0.67). Additionally, external validation on cross-year cohorts from 332 participants (GLOBEM 2020-2021) confirms cluster reproducibility (silhouette score = 0.63, AUC = 0.61) and stability. We further perform subtype analysis and visualize temporal attention, which highlights sleep-related differences between clusters and identifies salient time windows that align with changes in sleep regularity, yielding clinically interpretable explanations of risk.
- North America > United States > Maryland > Baltimore (0.05)
- Asia > Nepal (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.94)
User Profiles of Sleep Disorder Sufferers: Towards Explainable Clustering and Differential Variable Analysis
Sellami, Sifeddine, Agoun, Juba, Yessad, Lamia, Bounia, Louenas
Sleep disorders have a major impact on patients' health and quality of life, but their diagnosis remains complex due to the diversity of symptoms. Today, technological advances, combined with medical data analysis, are opening new perspectives for a better understanding of these disorders. In particular, explainable artificial intelligence (XAI) aims to make AI model decisions understandable and interpretable for users. In this study, we propose a clustering-based method to group patients according to different sleep disorder profiles. By integrating an explainable approach, we identify the key factors influencing these pathologies. An experiment on anonymized real data illustrates the effectiveness and relevance of our approach.
- Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- Asia (0.04)
- Health & Medicine > Therapeutic Area > Neurology (0.69)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.55)
Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models
Bonil, Gustavo, Gondim, João, Santos, Marina dos, Hashiguti, Simone, Maia, Helena, Silva, Nadia, Pedrini, Helio, Avila, Sandra
This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colo-nially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.
Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
Nagori, Aditya, Gautam, Ayush, Wiens, Matthew O., Nguyen, Vuong, Mugisha, Nathan Kenya, Kabakyenga, Jerome, Kissoon, Niranjan, Ansermino, John Mark, Kamaleswaran, Rishikesan
Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
- North America > Canada > British Columbia > Vancouver (0.05)
- Africa > Uganda > Western Region > Mbarara District (0.05)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Comparing Self-Disclosure Themes and Semantics to a Human, a Robot, and a Disembodied Agent
Chiang, Sophie, Laban, Guy, Cross, Emily S., Gunes, Hatice
As social robots and other artificial agents become more conversationally capable, it is important to understand whether the content and meaning of self-disclosure towards these agents changes depending on the agent's embodiment. In this study, we analysed conversational data from three controlled experiments in which participants self-disclosed to a human, a humanoid social robot, and a disembodied conversational agent. Using sentence embeddings and clustering, we identified themes in participants' disclosures, which were then labelled and explained by a large language model. We subsequently assessed whether these themes and the underlying semantic structure of the disclosures varied by agent embodiment. Our findings reveal strong consistency: thematic distributions did not significantly differ across embodiments, and semantic similarity analyses showed that disclosures were expressed in highly comparable ways. These results suggest that while embodiment may influence human behaviour in human-robot and human-agent interactions, people tend to maintain a consistent thematic focus and semantic structure in their disclosures, whether speaking to humans or artificial interlocutors.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
- Europe > Switzerland (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Scalable Robust Bayesian Co-Clustering with Compositional ELBOs
Vinod, Ashwin, Bajaj, Chandrajit
Co-clustering exploits the duality of instances and features to simultaneously uncover meaningful groups in both dimensions, often outperforming traditional clustering in high-dimensional or sparse data settings. Although recent deep learning approaches successfully integrate feature learning and cluster assignment, they remain susceptible to noise and can suffer from posterior collapse within standard autoencoders. In this paper, we present the first fully variational Co-clustering framework that directly learns row and column clusters in the latent space, leveraging a doubly reparameterized ELBO to improve gradient signal-to-noise separation. Our unsupervised model integrates a Variational Deep Embedding with a Gaussian Mixture Model (GMM) prior for both instances and features, providing a built-in clustering mechanism that naturally aligns latent modes with row and column clusters. Furthermore, our regularized end-to-end noise learning Compositional ELBO architecture jointly reconstructs the data while regularizing against noise through the KL divergence, thus gracefully handling corrupted or missing inputs in a single training pipeline. To counteract posterior collapse, we introduce a scale modification that increases the encoder's latent means only in the reconstruction pathway, preserving richer latent representations without inflating the KL term. Finally, a mutual information-based cross-loss ensures coherent co-clustering of rows and columns. Empirical results on diverse real-world datasets from multiple modalities, numerical, textual, and image-based, demonstrate that our method not only preserves the advantages of prior Co-clustering approaches but also exceeds them in accuracy and robustness, particularly in high-dimensional or noisy settings.
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Wisconsin (0.04)
- Europe > Germany > Bavaria > Regensburg (0.04)
- Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (0.94)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework
Rahal, Manal, Ahmed, Bestoun S., Szabados, Gergely, Fornstedt, Torgny, Samuelsson, Jorgen
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.
- Europe > Sweden > Värmland County > Karlstad (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (2 more...)
Fuel Efficiency Analysis of the Public Transportation System Based on the Gaussian Mixture Model Clustering
Ma, Zhipeng, Jørgensen, Bo Nørregaard, Ma, Zheng
Public transportation is a major source of greenhouse gas emissions, highlighting the need to improve bus fuel efficiency. Clustering algorithms assist in analyzing fuel efficiency by grouping data into clusters, but irrelevant features may complicate the analysis and choosing the optimal number of clusters remains a challenging task. Therefore, this paper employs the Gaussian mixture models to cluster the solo fuel-efficiency dataset. Moreover, an integration method that combines the Silhouette index, Calinski-Harabasz index, and Davies-Bouldin index is developed to select the optimal cluster numbers. A dataset with 4006 bus trips in North Jutland, Denmark is utilized as the case study. Trips are first split into three groups, then one group is divided further, resulting in four categories: extreme, normal, low, and extremely low fuel efficiency. A preliminary study using visualization analysis is conducted to investigate how driving behaviors and route conditions affect fuel efficiency. The results indicate that both individual driving habits and route characteristics have a significant influence on fuel efficiency.
- Europe > Denmark > North Jutland (0.25)
- Europe > Poland (0.04)
- Europe > Germany (0.04)
- Europe > Denmark > Southern Denmark (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (0.47)
Cluster Catch Digraphs with the Nearest Neighbor Distance
Shi, Rui, Billor, Nedret, Ceyhan, Elvan
We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley's K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley's K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test
- North America > United States > New York > Richmond County > New York City (0.04)
- North America > United States > New York > Queens County > New York City (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (9 more...)